--- layout: page title: Script 6 - Hypothesis Testing permalink: /scripts/script6/ parent: R Scripts nav_order: 6 ---
Run the code in this entire script in your own local R script - it is crucial to run the code yourself to familiarise yourself with the coding language. In general and for your own sake, I strongly recommend you always annotate your code including the output or result of your code unless it is trivial or already discussed below. Always make sure to be explicit about output and substantive results when responding to the task!
Let’s start with a brief recap of the data we used before (and we’ll be using this time)!
Levin provides evidence for the electoral consequences of attempts by great powers to intervene in a partisan manner in another country’s elections. Based on the assumption that foreign actors face strong incentives (motive and opportunity) to intervene in competitive elections, he argues that electoral interventions systematically increase the electoral chances of the supported candidate. Overt interventions, however, tend to be more effective than covert interventions. The types of intervention range from providing funding to the preferred side’s campaign to public threats to cutting off foreign aid. Between 1946 and 2000, the United States and the USSR/Russia intervened in this manner 117 times.
The dataset Levin.dta contains the following variables
of interest:
| Variable Name | Variable Description |
|---|---|
country |
Country name with a competitive election |
year |
Election year |
incumbent_vote |
Vote share of the incumbent’s party (in parliamentary systems) or of the incumbent party’s presidential candidate (in presidential/semi-presidential systems) |
elect_int |
Partisan electoral intervention (USA and Russia). 1 if an intervention favours the incumbent, -1 if it favours a challenger, and 0 when no intervention occurs. |
prev_vote |
Previous vote share of the incumbent (Previous vote) |
rgdplpcgw |
Real GDP per capita growth rate (Growth) |
com_openk |
Trade as a percentage of GDP in constant terms (Trade Openness) |
prezelection |
Dummy variable measuring whether a particular election is a presidential election (1) or not (0) (Presidential Election) |
reelection_prez |
A president is running for reelection in a presidential or Semi-Presidential system: Yes=1, No=0 (Re-election) |
lfrag_com |
Effective number of parties or candidates contesting the election (Effective num. of Parties) |
lgcm_rgdpl_pc |
(Logged) GDP per capita in thousand 2005 constant U.S dollars |
africa |
Regional dummy: Africa |
asia |
Regional dummy: Asia |
eastcentraleurope |
Regional dummy: Central and E. Europe |
lamericacarribean |
Regional dummy: America and Caribbean |
coldwar |
Cold war dummy |
In the previous and current session, we discuss probability, probability distributions and the accuracy of estimates, including standard errors, confidence intervals for means and proportions, hypothesis testing, type I and II errors, and p-values. We will use this script to familiarise ourselves with these concepts and the respective commands in R. We start by looking at issues surrounding hypothesis testing and sampling using Levin’s study on great power electoral interventions.
Suppose we draw all possible samples of size \(n\) from a given population. Suppose further that we compute a statistic (e.g. a mean, standard deviation) for each sample. The probability distribution of this statistic is called a sampling distribution (e.g. sampling distribution of the mean). And the standard deviation of this statistic is called the standard error.
Note: the standard deviation measures the amount of variability, or dispersion, for a subject set of data from the mean; whereas the standard error (of the mean) measures how far the sample mean of the sample data is likely to be from the true population mean.
Why do we get an approximation of a normal distribution?
The central limit theorem states that the sampling distribution of the mean of any random variable will be normal or nearly normal, if the sample size is just large enough.
For the estimate of the mean to be useful, we must have some idea of how precise it is. That is, how close to the population mean is the sample mean estimate likely to be? This is most commonly done by calculating a confidence interval around the sample mean. Confidence intervals are calculated in such a way that, under (hypothetical) repeated sampling, the population parameter of interest (e.g. the mean or median) is contained in the confidence interval with a given probability. So, for instance, the population mean lies within the 95% confidence interval in 95% of random samples.
Note: This is not the same as saying that a 95%
confidence interval contains the population mean with a probability of
.95, although this is a common misinterpretation.
95% confidence means that we used a procedure that ‘works’ 95%
of the time to estimate that specific interval. That is, 95% of all
intervals produced by the procedure will contain the true value of the
population parameter we seek to estimate. For any one
particular interval, the true population parameter of interest is either
inside the interval or outside the interval.
The confidence interval centers on the estimated statistic
(e.g. mean) and it extends symmetrically around the point estimate by a
factor called the margin of error. We can write this as:
CI= point estimate ± margin of error. Most often, you will
see either the 95% or 99% confidence interval.
How can we calculate confidence intervals?
Steps:
• The standard error of the estimate = standard deviation devided by the square root of the number of observations
• \(SE = sd/ \sqrt{n}\)
• Margin of Error = product of the standard error * the z-score associated with the confidence level of our choice
• Recall, a z-score is the number of standard deviations a data point falls away from the mean
Note: When you have multiple samples and want to describe the standard deviation of one of those sample means (that is, their standard error), you need the z-score that tells you where the score lies on a normal distribution curve. In other words, it shows how many standard errors there are between the sample mean and the (unknown/estimated) population mean. Recall:
• A z-score of 1 is 1 standard deviation above the mean value of the reference population (a population whose known values have been recorded)
• A z-score of 2 is 2 standard deviations above the mean value of the reference population
• A z-score of zero tells you the values is exactly the mean while a score of +3 tells you that the value is much higher than average.
Let’s give this a shot in R:
#Let's first load our data:
#setwd
setwd("~")
library(foreign)
Levin <- read.dta("Levin.dta")
# Step 1: Standard Error of our dependent variable
IV.se <- sd(Levin$incumbent_vote)/sqrt(length(Levin$incumbent_vote))
# Step 2: Confidence Intervals: Point estimate +- multiple of the standard error (z-score)
# The z-score of a 95% Confidence Interval is 1.96
# for a 90% Confidence Interval = 1.645
# for a 99% Confidence Interval = 2.58
# Step 3: Calculate the upper and lower bounds of the confidence intervals
ci_low <-mean(Levin$incumbent_vote) - IV.se*1.96
#and the upper estimate as
ci_upper <-mean(Levin$incumbent_vote) + IV.se*1.96
ci.an <- c(ci_low, ci_upper) # c combines its arguments to form a vector
ci.an
It’s important to understand how confidence intervals are being calculated - and how you can calculate them on your own. Nonetheless, we - luckily - don’t have to usually do this as R can simply calculate them for us:
## Calculation of CIs using R-package "Rmisc"
#install.packages("Rmisc")
library(Rmisc)
# use the CI() command to compute the confidence interval for the mean of incumbent_vote
CI(Levin$incumbent_vote, ci=0.95)
So, given the 95% confidence interval that we estimated in R, we see the range of values where the population mean is likely placed. A rough rule of thumb is that the 95% confidence interval is the mean plus or minus twice the standard error. If you’re not that much into rules of thumbs, being aware that it’s actually 1.96-times doesn’t hurt.
Note, the size of our confidence intervals depends on the sample size. In case of a small sample size the confidence interval is rather wide. In other words, the larger our sample, the smaller our standard error and therefore the narrower the confidence interval and the more precise the estimate of the population mean.This should seem intuitive - as our sample size approaches the population size, uncertainty greatly diminishes.
Usually, the interpretation of a 95% confidence interval is that under repeated samples or experiments, 95% of the resulting intervals would contain the unknown parameter in question. However, for binomial data, the actual coverage probability, regardless of method, usually differs from that interpretation. This is because of the discreteness of the binomial distribution, which produces only a finite set of outcomes, meaning that coverage probabilities are subject to discrete jumps and that some levels cannot be achieved. Thus, to avoid confidence intervals that include negative values or values larger than one, we resort to probabilities when calculating the standard error.
Let’s assume that we are interested in whether one specific election was a presidential election or not because we think that this might have an overall effect on our dependent variable. How do we calculate the confidence interval for this variable?
Steps:
# Step 1:
table(Levin$prezelection)
p.pelect <- 286/(565+286) # number of cases with presidential elections devided
#by the population
# Step 2: Getting the standard error
se.pelection <- sqrt(p.pelect*(1-p.pelect)/length(Levin$prezelection))
# Step 3: Calculate the lower and upper bounds of the normal distribution
ci_low <- mean(Levin$prezelection) - 1.96*se.pelection
ci_up <- mean(Levin$prezelection) + 1.96*se.pelection
ci.an <- c(ci_low, ci_up)
ci.an
# Alternatively, using the CI() function from the Rmisc package
CI(Levin$prezelection[!is.na(Levin$prezelection)], ci=0.95)
Having just learned how to calculate confidence intervals, we should take a moment to look at the relationship between sample size and confidence intervals. This is important to understand because it shows that standard errors, t-statistics, p-values, and confidence intervals are all related.
## Sample size of random samples
set.seed(3)
sample1 <- Levin[sample(nrow(Levin),100,replace=FALSE),]
set.seed(3)
sample2 <- Levin[sample(nrow(Levin),500,replace=FALSE),]
set.seed(3)
sample3 <- Levin[sample(nrow(Levin),800,replace=FALSE),]
CIsample1 <- CI(sample1$incumbent_vote, ci=0.95)
CIsample2 <- CI(sample2$incumbent_vote, ci=0.95)
CIsample3 <- CI(sample3$incumbent_vote, ci=0.95)
ci.sample <- rbind(CIsample1, CIsample2, CIsample3)
ci.sample
Increasing the sample size decreases the width of confidence intervals, because it decreases the standard error. The larger the sample size the more information we have - and our uncertainty decreases. Note that this is crucial for substantive statements based on regression analysis. The larger the sample size, the more likely we are to find ‘statistically significant’ differences - even if they are very minor, sometimes negligible, in magnitude.
If we find that two groups have different means for a given variable, how do we know whether this is due to chance in the particular sample that we have drawn - or if there is reason to believe that the two groups are actually different with respect to this particular characteristic? What we want to know is the probability that the means of the two groups are different in the population. Typically, when there is sufficient evidence in our sample to say that there is less than a five per cent chance that we got this result purely by chance when the difference in the underlying population is actually zero, we say that there is a statistically significant difference in means. Recall that ‘statistical significance’ is not the same as substantive significance! Even very small differences are statistically significant if we are confident that they reflect a real difference in the population - but they may not matter for our purposes.
The t.test function allows us to conduct tests on the
difference in means. We will use it to test the difference of the
average incumbent vote share during and after the Cold War
period.
The following formula represents the the t-statistic. What a t-statistic essentially considers are the difference in means between the groups; overall and respective group distributions of the variable; and the number of observations per group:
\[ t=\frac{\overline{x_1}-\overline{x_2}}{\sqrt{\left(s^2\left(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}\right)\right)}} \]
Now let’s see how to compute the t-statistic with R:
ttest <- t.test(Levin$incumbent_vote ~ Levin$coldwar)
ttest
ttest$statistic
The output might seem a bit complex but it’s actually quite straightforward and insightful:
• The first thing to note from the table is that the mean for vote
share in the post-Cold War period is 3.9%-points higher
than in the Cold War period.
• The p-value and the confidence intervals indicate whether the null hypothesis (that there is no difference between the cold war and post cold war period regarding vote share) is likely to hold or not. The p-value is the probability of observing a value of the test statistic that is as or more extreme than what was observed in the sample, under the assumption that the null hypothesis is true. We get a small p-value of 0.001, which leads us to reject the null hypothesis that there is no difference between the two groups. In other words, we are extremely confident that the difference which we observe between our two groups is not random - there is a real difference in the population.
Note: A simple comparison of means for a variable between two groups is only providing an unbiased estimate when the setting is quasi-random, meaning there are no possible confounders. This is unlikely when we have observational data, which is usually the case in social sciences such as Economics, Political Science and IR. That is why it is so important to understand regression outputs.
Now we know everything we need and are set! Let’s estimate a multivariate model and make sure to go through all statistics R provides us with step by step. There’s a lot to take out from a single regression.
Levin.model <- lm(incumbent_vote ~ elect_int
+ prev_vote + rgdplpcgw
+ com_openk + prezelection
+ reelection_prez + lfrag_com
+ lgcm_rgdpl_pc + africa
+ asia + eastcentraleurope
+ lamericacarribean,
data=Levin)
# to see the regression results in detail, use summary(thenameofyourmodel)
summary(Levin.model)
Formula Call: the first item shown in the output is the formula R used to fit the data.
Residuals: The next item in the model output refers to the residuals. Residuals are essentially the difference between the actual observed response values of vote share and the response values that the model predicted based on our explanatory variables. The residuals section of the model output breaks it down into 5 summary points. When assessing how well the model fit the data, you should look for a symmetrical distribution across these points on the mean value zero (0). In our example, we can see that the distribution of the residuals does not appear to be strongly symmetrical. That means that the model predicts certain points that fall far away from the actual observed data points. We could take this further consider plotting the residuals to see whether they are normally distributed.
Coefficient - Estimate: The intercept, in our example, is the expected value of vote share given that all explanatory variables equal 0 (remember that this includes the category coded as zero for binary variables!). In other words, the average vote share of incumbents is 24 percent all else being 0 (which statistically makes sense - but this doesn’t mean it makes sense substantively. Think of predicted income at age 0, for instance).
How would you interpret the slope coefficients? For instance, in our example, we see that the average predicted vote share for the incumbent is 1.3 percentage points lower in presidential elections than in non-presidential elections. In other words, we can say that the predicted average vote share for incumbents in presidential elections versus non-presidential elections varies by 1.3 percentage points - which equals our coefficient estimate.
Standard Error: The coefficient standard error is an estimate of the standard deviation of a sampling distribution and, therefore, of our coefficient estimate. It measures the average extent that the coefficient estimate varies from the actual average value of our response variable i.e. if we were to run the model again and again, the Standard Error provides an estimate of the expected difference. It is thus an estimate of the variance of the magnitude of the effect, or the extent of the relationship between each covariate variable and the outcome variable. If the standard error is large, then the effect size will have to be stronger for us to be able to be sure that it’s a real effect, and not just a random artifact. We’d ideally want a lower number relative to its respective coefficients. Interpretation of standard errors in regression output is as follows: in 95% of surveys or “repeated samples”, the difference between our estimate and the true coefficient value is less than 1.96*SE of the estimate. As you have seen, standard errors can also be used to compute confidence intervals and to statistically test the hypothesis of the existence of a relationship between our dependent and each respective independent variable.
Note: Recall that the margin of error gives you a sense of how much the estimate would vary across models based on repeated sampling. The standard error in regression output plays the same role: in 95% of repeated samples (or surveys), the difference between our estimate and the true value is less than approximately 1.96-times the standard error. You may have noticed now that once you know the standard error, you can calculate confidence intervals, and vice-versa.
T-statistic or t-value: The t-value is an estimate of “how extreme your estimate is”, relative to the standard error (assuming a normal distribution, centred on the value suggested by the null hypothesis). Technically, t-values are based on Student’s t-distribution, which approximates a normal distribution with increasing degrees of freedom and, practically, is equivalent to a normal distribution with degrees of freedom greater than about 30. In turn, the t-distribution poses a higher threshold than the normal distribution when the number of observations is very low.
More specifically, under the null hyptohesis usually used in OLS, the t-value is a measure of how many standard errors our coefficient estimate is away from 0. The greater the t-value, the stronger the evidence against the null hypothesis (here, no relationship between presidential elections and vote share). The closer to 0, the more likely it is that there is in fact no relationship between the independent variable and the outcome of interest. In our example, the t-values are relatively close to zero, which indicates that there is no relationship. Now, what would be the probability of obtaining this coefficient estimate (or a larger one) if we took another sample from the same population?
P-value or Pr(>t): The p-value indicates the probability of observing a coefficient at least as extreme as our estimate in a sample if the true coefficient in the population was zero. In other words, it is the probability of obtaining an effect that is at least as large (i.e. as far from 0) as the estimate in our sample data, assuming that the true effect is zero. “Assuming that the true effect is zero” is what we usually refer to as the “null hypothesis”. The p-value evaluates how well the sample data support the idea that the null hypothesis holds. In other words, it measures the compatibility of the sample with the null hypothesis (and not the truth of the hypothesis!). A small p-value indicates that it is unlikely that we would observe a relationship between the predictor (presidential elections) and outcome (vote share) variables only due to chance. Typically, a p-value of 5% or less is a conventional cut-off point. Yet, a p-value of 0.05 means that if we repeated this study, we would wrongly reject the null hypothesis 5 out of 100 times.
In our example, we observe a large p-value so we cannot conclude that that there is a relationship between presidential elections and incumbent vote share. We do not reject the null hypothesis.
Residual Standard Error: The residual standard error is measure of the quality of a linear regression fit. Every linear model is assumed to contain an error term (\(\epsilon\)). Due to the presence of this error term, we are not capable of perfectly predicting our dependent variable (vote share) based on our independent variables. The Residual Standard Error is the average amount that the dependent variable will deviate from the regression line. In our example, the actual vote share can, on average, deviate from the regression line by approximately 9 percentage points. In other words, given that the average vote share for incumbents is 24.4 and that the Residual Standard Error is 9.25, we can say that the percentage error is - any prediction would still be off by - 37,9%. Our Residual Standard Error was calculated with 685 degrees of freedom. Degrees of freedom are the number of data points that went into the estimation of the parameters after taking into account all relevant parameters (restriction). In our case, we had 698 data points and 13 parameters. 243 observations were deleted due to missing values.
Multiple R-Squared and Adjusted R-squared: The \(R^2\) statistic provides a measure of how well the model is fitting the actual data. It takes the form of a proportion of variance. \(R^2\) is a measure of the linear relationship between our dependent variable (vote share) and our independent variables. It always lies between 0 and 1 (i.e.: a number near 0 represents a regression that does not explain the variance in the response variable well and a number close to 1 does fully explain the observed variance in the response variable). In our example, we get an \(R^2\) of 0.5491. This means, roughly 55% of the variance found in vote share can be explained by our independent variables. In multiple regression settings, the \(R^2\) will always increase as more variables are included in the model. That’s why the adjusted \(R^2\) is the preferred measure as it adjusts for the number of variables used for the model.
F-statistic: The F-statistic is a good indicator of whether there is a relationship between our independent and the dependent variables. Simply speaking, it tells us if our entire model provides a statistically significant improvement in explaining the dependent variable relative to only relying on values of the dependent variable. The further the F-statistic is from 1, the better. However, how much larger the F-statistic needs to be depends on both the number of data points and the number of predictors. Generally, when the number of data points is large, an F-statistic that is only a little bit larger than 1 is already sufficient to reject the null hypothesis (H0 : There is no relationship between dependent and independent variables). The reverse is true as if the number of data points is small, a large F-statistic is required to be able to ascertain that there may be a relationship. In our example the F-statistic is 69.51 which is clearly larger than 1 given the size of our data.
What about the confidence intervals? Unlike in Stata, regression outputs in R do not contain confidence intervals - but it is always useful to have a look at them because it helps to determine whether there might be issues with our regression. Also, if you want to visualize your outcome in a graph, it is always advisable to include confidence intervals as they convey important information and tell you a lot about the accuracy of the estimate given a specific value of \(X\)!
# to compute the CIs for regression coefficients
confint(Levin.model)